Parallel Programming/Lesson 3
- Types of parallel communication patterns
- Stencil - tasks read inputs from a fixed neighborhood in an array
- 2D von neumann stencil: cross
- 2D moore everything in neighbors including diagonals
- Transpose
- Array of Structures = AoS
- Structure of Arrays = SoA
- Efficiency
- Threads
- thread blocks – group of threads that solve a sub problem
- can pick sizes based on kernel
- Blocks
- Streaming multiprocessors: SMs
- small 1 SM, large: 16
- SMs have simple processors
- GPU allocates blocks to SMs
- All SMS run in parallel independently
- No guarantees from CUDA
- hardware can run blocks whenever it wants
- blocks must complete
- GPU memory model
- global memory
- shared memory – at lock level
- host memory
- local memory
- Synchronization
- barrier – point where all threads stop
- Memory types
- Threadblocks are organized in kernels
- There's an implicit barrier between kernels
- Computation / Memory Spaces / Synchronization == Cuda
- Maximize arithmetic intensity (math / memory)
- more work per thread (useful compute)
- minimize time on memory per thread (time to access memory)
- local > shared >> global >>> cpu
- local = registers, l1 cache
- Parameters are local memory
- cudaMalloc, cudamemcpy, cudamemset
__shared__
are visible to the full threadblock
__syncthreads()
- Can also coalesce access to global memory
- threads read/write contiguous memory locations
- "coalesced access"
- atomicAdd, atomicMin, atomicCAS
- Limitations of atomic
- add, subtract, xor, etc. – no mod, exponentiate
- only supports some types (mostly ints)
- can use atomiccas to implement anything
- still no ordering constraints
- floating point arithmetic is not associative
- no magic, thread access is serialized
- Avoid thread divergence
- branches in the threads
- if/else
- loops